Classifiers

Douwe Molenaar

Systems Biology Lab

2024-11-01

Introduction

The response may be a categorical variable

In previous models we had

\[y = f(\vec{x}, \vec{\beta})\] With response \(y\) being a continuous variable and \(f(\cdot)\) being linear in the parameters \(\beta_i\).

  • This is called regression (linear or non-linear, depending on how \(x_j\) appears in \(f\))
  • When \(y\) is a categorical variable, we call such a model a classifier.
  • The outcome \(y\) is
    • a single class label
    • a probability distribution over the class labels

Example of \(y\) being a categorical variable

  • \(y\) is health status, for example ill or healthy
  • \(\vec{x}\) could be categorical or continuous predictor variables, like Weight, Gender, Smoking, Age, etc.
  • The model \(y = f(\vec{x}, \vec{\beta})\) could, after fitting to data, and having estimated values \(\hat{\vec{\beta}}\) predict:
    • A single class, ill or healthy
    • The probability \(\theta\) and \(1-\theta\) of a person being in classes ill or healthy

The Iris data set

  • Three species of iris: setosa, virginica and versicolor
  • 4 characteristics measured: length and width of two leaf types
  Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
1          6.3         3.4          5.6         2.4  virginica
2          4.9         3.1          1.5         0.1     setosa
3          5.0         3.3          1.4         0.2     setosa
4          5.0         3.4          1.5         0.2     setosa
5          6.0         3.4          4.5         1.6 versicolor

Predicting Iris Species

  • Using a subset of the data
    • only versicolor and virginica
  • \(y\) Iris species:
    • versicolor (=0) and virginica (=1)
    • added vertical jitter to visualize individual measurements
  • \(x\) is Sepal.Length (numeric)

Fitting using linear regression

The line represents a linear regression. Not very useful as a predictor of the probability of being species 1 or 0, because it can assume values \(>1\) and \(<0\).

Fitting using linear regression

The line represents a linear regression. Not very useful as a predictor of the probability of being species 1 or 0, because it can assume values \(>1\) and \(<0\).

However, we could make a classifier with the following decision rule:

\[ \text{species} = \begin{cases} \text{versicolor} & \text{if } f(\text{Sepal.Length},\hat{\beta_0},\hat{\beta_1}) \leq 0.5 \\ \text{virginica} & \text{if } f(\text{Sepal.Length},\hat{\beta_0},\hat{\beta_1}) > 0.5 \end{cases} \]

Logistic regression

A better approach

Modeling the distributions of Sepal.Length


  • The smooth curves are normal distributions fitted to the histograms
    • These are fitted assuming equal standard deviations

Deducing probabilities from histograms and their fits


  • Dots are ratios of bin sizes \(\frac{\text{virginica}}{\text{virginica}+\text{versicolor}}\)
  • The curve is the ratio of the fitted normal distributions.

The log-odds ratio or logit

The log-odds ratio is the logarithm of the ratio of the probability of being a virginica, \(\theta\), over that of being a versicolor, \(1-\theta\)

\[ \text{log-odds} = \ln{\left( \frac{\theta}{1-\theta} \right)} \]


Note that

  • the range of the log-odds ratio is \((-\infty, \infty)\)
  • \(\text{log-odds}=0\) when \(\theta = 1 - \theta\) (equal probability)

Surprise!


The log-odds ratio of the probability fitted on the iris data of is a straight line.

Idea behind logistic regression

Apparently, the log-odds of the parameter \(\theta(x)\) can be modeled as a straight line, a function of the predictor variable \(x=\) Sepal.Length

\[ \ln{\left( \frac{\theta(x)}{1-\theta(x)} \right)} = \beta_0 + \beta_1 x \]

Equivalently

\[\theta(x) = \frac{1}{1 + e^{-(\beta_0 + \beta_1 x)}}\]

The full model, including binomial distribution of samples

When observing an iris plant with sepal length \(x\), we have:

  • a virginica with probability \(\theta(x)\)
  • a versicolor with probability \(1 - \theta(x)\)

The probability of observing \(k\) virginica individuals when observing \(n\) iris plants having a sepal length equal to \(x\) equals1:

\[ p(k\text{ virginica }|n) = \text{Binom}(n,k,\theta) = \binom{n}{k} \theta^k (1 - \theta)^{n-k} \]

where

\[ \theta = \theta(x) = \frac{1}{1 + e^{-(\beta_0 + \beta_1 x)}} \]

Logistic regression in R

Fit a line of the type \(y = \frac{1}{1 + e^{f(x)}}\) where \(f(x)\) is a linear function of \(x\).

The range of \(f(x)\) is \((-\infty, +\infty)\), but that of \(y\) is \((0,1)\)

log.model <- glm(Species ~ Sepal.Length, 
                 family='binomial')

How well does it fit the training data?

We classify as follows:

  • if \(p\leq0.5\) then species = 0
  • if \(p>0.5\) then species = 1
            Prediction
Species      versicolor virginica
  versicolor         36        14
  virginica          13        37

The training error of this classifier is 30%

But wait, we have more predictors than Sepal.Length

Namely also Sepal.Width, Petal.Width and Petal.Length

full.log.model <- glm(Species ~ Petal.Width + 
                      Petal.Length + 
                      Sepal.Width + 
                      Sepal.Length, 
                      family='binomial')


The training error equals only 2%!

Intermezzo: Generalized Linear Modeling

Generalized linear modeling

Logistic regression in R is implemented using the glm() function for Generalized Linear Modeling

GLM is used when

  • Response variables are not normal distributed but from exponential family
    • Poisson, binomial, gamma, etc.
  • Expected values of response variables can be expressed as linear combinations \(f(x)\) of predictor variables (\(x\)) through a link function \(g()\)

\[\mathbb{E}[y] = g^{-1}(f(x))\]

Logistic regression as a case of GLM

  • In logistic regression the response variable is binomially distributed.
  • The link function is the logit function (log-odds)

\[g(\theta) = \ln{\left( \frac{\theta}{1 - \theta} \right)}\]

  • Hence the call glm(Species ~ Sepal.Length, family='binomial')
    • The proper link function is automatically chosen depending on the distribution


See Chapter 19 for more background on maximum likelihood fitting.

Linear Discriminant Analysis

Linear discriminant analysis

The idea behind Discriminant Analysis

  • Model the probability density distributions of each of the classes as normal distributions
  • If \(p(y=A|x) > p(y=B|x)\) then classify the object as \(A\), otherwise as \(B\)

Here \(y\) is the species label and \(x\) is the variable Sepal.Length

Linear discriminant analysis in 2 dimensions

  • Use multivariate normal distributions
  • Assume the same covariance for all classes
    • Decision boundaries will be straight lines, planes or hyperplanes

Linear discriminant analysis in 2 dimensions

  • Use multivariate normal distributions
  • Assume the same covariance for all classes
    • Decision boundaries will be straight lines, planes or hyperplanes


Question

How will the fractions of the species in the training set affect the decision boundaries of the LDA predictor?

LDA in R

ldam <- lda(Species ~ Petal.Length + Sepal.Length, data= iris)

Probability distributions over the class labels

sample <- sample_n(iris, 7)
predictions <- predict(ldam, newdata = sample)
bind_cols(sample['Species'], as.data.frame(predictions$posterior))


Species setosa versicolor virginica
versicolor 0 0.998 0.002
virginica 0 0.004 0.996
versicolor 0 0.921 0.079
virginica 0 0.000 1.000
setosa 1 0.000 0.000
versicolor 0 0.951 0.049
virginica 0 0.496 0.504

Comparing DA and logistic regression

Discriminant analysis

  • In DA we model how the probability distribution function of each class depends on the variables
  • From these individual PDF’s we can calculate the likelihood of a case belonging to each of the classes

Logistic regression

  • Directly models the probability of a case belonging to each of the classes

Discriminative vs Generative models1

  • Logistic regression is a discriminative type of modeling
  • Discriminant analysis is a generative type of modeling

When DA out-performs Logistic regression

A pathological case for logistic regression

  • Well-separated groups cause problems
  • The algorithm does not converge because the parameters can not be estimated
  • Both curves in the figure have the same likelihood up to machine accuracy
    • The optimal likelihood can not be determined

The naive Bayes classifier

Using Bayes rule as a classifier

Variables:

  • \(M\): mitochondrial signal in the protein: \(y,n\) (yes, no)
  • \(L\): location of the protein in the cell, possible values: \(c,m\) (cytoplasm, mitochondrion)

Challenge:

  • We have assessed the conditional distribution \(P(\text{M}|\text{L})\)
    • \(P(M = y|L = m) = 0.8\), \(P(M = y|L = c) = 0.15\).
  • A protein is randomly picked from a population with the distribution \(P(L=m) = 0.1\)
  • The protein drawn carries a mitochondrial signal: \(M = y\)
  • Predict the location of this protein

Using Bayes rule as a classifier

Plan:

Compare \(P(L=m|M=y)\) to \(P(L=c|M=y)\)

Decision rule

\(L=m\) if \(P(L=m|M=y) > P(L=c|M=y)\), otherwise \(L=c\)

\[ \begin{align} P(L=m|M=y) &= \frac{P(M=y|L=m) \cdot P(L=m)}{P(M=y)} \\ &= \frac{0.8 \times 0.1}{P(M=y)} = \frac{0.08}{P(M=y)} \end{align} \]

\[ P(L=c|M=y) = \frac{0.135}{P(M=y)} \quad \text{(Show this)} \]

We can now decide that \(L = c\). We don’t have to know \(P(M=y)\)!

We have another, continuous predictor for protein location

Previously we needed a to know the discrete distribution \(P(M|L)\).

What if \(S\) is a continuous variable and we need \(p(S|L)\)?

  • Assess \(S\) in proteins with known location
  • Fit the data using a suitable distribution function
  • Obtain two conditional probability distribution functions, one for mitochondrial and one for cytoplasmic proteins

Using Bayes rule on a continuous predictor

  • The protein that we picked earlier has a value \(S = 22\)
  • What is the location of this protein?
  • Compare \(p(L=c | S=22)\) to \(p(L=m | S=22)\)

\[ \begin{align} p(L = c | S = 22) &\propto p(S=22 | L = c) \times P(L = c) \\ &\propto 0.00769 \times 0.9 = 0.00692 \end{align} \]

\[ \begin{align} p(L = m | S = 22) &\propto p(S = 22 | L = m) \times P(L = m) \\ &\propto 0.0225 \times 0.1 = 0.00225 \end{align} \]

We decide (again) that \(L = c\)

Intermezzo: probability density is not a probability

Note that we are sloppy in the previous slide.

\(p(S|L)\) is a probability density function (of \(S\)), not a probability.

However, in a very tiny interval \(S = 22 \pm \delta/2\)

\[ P\left( S = 22 \pm \delta/2 | L\right) \approx p(S=22|L) \times \delta \]

Using Bayes rule with both predictors

By definition

\[ p(M|L) = \frac{p(M,L)}{p(L)} \quad \text{and} \quad p(S|M,L) = \frac{p(L,M,S)}{p(M,L)} \]

From which we get \(\qquad p(L,M,S) = p(S|M,L) \cdot p(M|L) \cdot p(L)\)

Also, by definition

\[ p(L|M,S) = \frac{p(L,M,S)}{p(M,S)} \]

By substitution we obtain:

Bayes rule for joint distributions

\[ p(L|M,S) = \frac{p(S|M,L) \cdot p(M|L) \cdot p(L)}{p(M,S)} \]

But how can we estimate \(p(S|M,L)\)?

The problem with multiple predictors

  • \(p(S|M,L)\) is a set of 4 probability density functions (4 combinations in the Cartesian product \(M \times L\))
  • In case of more predictor variables the dimensions of \(p(X_1|X_2, X_3,\ldots,X_n,L)\) explode

Solution: the naive assumption

  • All predictor variables are independent conditional on \(L\)
  • For example \(p(S|M,L) = p(S|L)\): \(S\) does not depend on \(M\), in the sub-populations of \(L\)

Applying the naive assumption we get

\[ p(L|M,S) = \frac{p(S|L) \cdot p(M|L) \cdot p(L)}{p(M,S)} \]

In case of our example

Compare \(p(L=m|S=22,M=y)\) to \(p(L=c|S=22,M=y)\)

\[ \begin{align} & p(L = m | S=22,M = y) \\ & \propto p(S=22|L = m) \times p(M = y | L = m) \times p(L = m) \\ & = 0.0225 \times 0.8 \times 0.1 = 0.0018 \end{align} \]

\[ \begin{align} & p(L = c | S = 22, M = y) \\ & \propto p(S=22|L = c) \times p(M = y | L = c) \times p(L = c) \\ & = 0.00769 \times 0.15 \times 0.9 = 0.00104 \end{align} \]

Given the combined information we decide that \(L=m\)

Causal scheme for the naive assumption

\[M \perp S|L\]

  • For a given location, S and M are independent variables
  • All correlation between S and M is explained by the location


Naive assumption in general

Every pair of predictors is independent conditional on the response


Conclusion

  • Despite (because of?) the naive assumption, naive Bayes classifiers often perform extremely well1
  • See example on the iris data set in the syllabus
  • Prior distributions can be modified when desired (here \(p(L)\))
  • They are used a lot as a simple type of classifier
  • Only use it as a classifier, not to predict distributions over classes!